{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# \ud83d\udcc3 Solution for Exercise M7.03\n",
    "\n",
    "As with the classification metrics exercise, we will evaluate the regression\n",
    "metrics within a cross-validation framework to get familiar with the syntax.\n",
    "\n",
    "We will use the Ames house prices dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "ames_housing = pd.read_csv(\"../datasets/house_prices.csv\")\n",
    "data = ames_housing.drop(columns=\"SalePrice\")\n",
    "target = ames_housing[\"SalePrice\"]\n",
    "data = data.select_dtypes(np.number)\n",
    "target /= 1000"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 2
   },
   "source": [
    "<div class=\"admonition note alert alert-info\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
    "<p class=\"last\">If you want a deeper overview regarding this dataset, you can refer to the\n",
    "Appendix - Datasets description section at the end of this MOOC.</p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first step will be to create a linear regression model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# solution\n",
    "from sklearn.linear_model import LinearRegression\n",
    "\n",
    "model = LinearRegression()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, use the `cross_val_score` to estimate the generalization performance of\n",
    "the model. Use a `KFold` cross-validation with 10 folds. Make the use of the\n",
    "$R^2$ score explicit by assigning the parameter `scoring` (even though it is\n",
    "the default score)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# solution\n",
    "from sklearn.model_selection import cross_val_score\n",
    "\n",
    "scores = cross_val_score(model, data, target, cv=10, scoring=\"r2\")\n",
    "print(f\"R2 score: {scores.mean():.3f} \u00b1 {scores.std():.3f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, instead of using the $R^2$ score, use the mean absolute error (MAE). You\n",
    "may need to refer to the documentation for the `scoring` parameter."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# solution\n",
    "scores = cross_val_score(\n",
    "    model, data, target, cv=10, scoring=\"neg_mean_absolute_error\"\n",
    ")\n",
    "errors = -scores\n",
    "print(f\"Mean absolute error: {errors.mean():.3f} k$ \u00b1 {errors.std():.3f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "source": [
    "The `scoring` parameter in scikit-learn expects score. It means that the\n",
    "higher the values, and the smaller the errors are, the better the model is.\n",
    "Therefore, the error should be multiplied by -1. That's why the string given\n",
    "the `scoring` starts with `neg_` when dealing with metrics which are errors."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, use the `cross_validate` function and compute multiple scores/errors\n",
    "at once by passing a list of scorers to the `scoring` parameter. You can\n",
    "compute the $R^2$ score and the mean absolute error for instance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# solution\n",
    "from sklearn.model_selection import cross_validate\n",
    "\n",
    "scoring = [\"r2\", \"neg_mean_absolute_error\"]\n",
    "cv_results = cross_validate(model, data, target, scoring=scoring)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "scores = {\n",
    "    \"R2\": cv_results[\"test_r2\"],\n",
    "    \"MAE\": -cv_results[\"test_neg_mean_absolute_error\"],\n",
    "}\n",
    "scores = pd.DataFrame(scores)\n",
    "scores"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "source": [
    "In the Regression Metrics notebook, we introduced the concept of loss function,\n",
    "which is the metric optimized when training a model. In the case of\n",
    "`LinearRegression`, the fitting process consists in minimizing the mean squared\n",
    "error (MSE). Some estimators, such as `HistGradientBoostingRegressor`, can\n",
    "use different loss functions, to be set using the `loss` hyperparameter.\n",
    "\n",
    "Notice that the evaluation metrics and the loss functions are not necessarily\n",
    "the same. Let's see an example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# solution\n",
    "from collections import defaultdict\n",
    "from sklearn.ensemble import HistGradientBoostingRegressor\n",
    "\n",
    "scoring = [\"neg_mean_squared_error\", \"neg_mean_absolute_error\"]\n",
    "loss_functions = [\"squared_error\", \"absolute_error\"]\n",
    "scores = defaultdict(list)\n",
    "\n",
    "for loss_func in loss_functions:\n",
    "    model = HistGradientBoostingRegressor(loss=loss_func)\n",
    "    cv_results = cross_validate(model, data, target, scoring=scoring)\n",
    "    mse = -cv_results[\"test_neg_mean_squared_error\"]\n",
    "    mae = -cv_results[\"test_neg_mean_absolute_error\"]\n",
    "    scores[\"loss\"].append(loss_func)\n",
    "    scores[\"MSE\"].append(f\"{mse.mean():.1f} \u00b1 {mse.std():.1f}\")\n",
    "    scores[\"MAE\"].append(f\"{mae.mean():.1f} \u00b1 {mae.std():.1f}\")\n",
    "scores = pd.DataFrame(scores)\n",
    "scores.set_index(\"loss\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "source": [
    "Even if the score distributions overlap due to the presence of outliers in the\n",
    "dataset, it is true that the average MSE is lower when `loss=\"squared_error\"`,\n",
    "whereas the average MAE is lower when `loss=\"absolute_error\"` as expected.\n",
    "Indeed, the choice of a loss function is made depending on the evaluation\n",
    "metric that we want to optimize for a given use case.\n",
    "\n",
    "If you feel like going beyond the contents of this MOOC, you can try different\n",
    "combinations of loss functions and evaluation metrics.\n",
    "\n",
    "Notice that there are some metrics that cannot be directly optimized by\n",
    "optimizing a loss function. This is the case for metrics that evolve in a\n",
    "discontinuous manner with respect to the internal parameters of the model, as\n",
    "learning solvers based on gradient descent or similar optimizers require\n",
    "continuity (the details are beyond the scope of this MOOC).\n",
    "\n",
    "For instance, classification models are often evaluated using metrics computed\n",
    "on hard class predictions (i.e. whether a sample belongs to a given class)\n",
    "rather than from continuous values such as\n",
    "[`predict_proba`](https://scikit-learn.org/stable/glossary.html#term-predict_proba)\n",
    "(i.e. the estimated probability of belonging to said given class). Because of\n",
    "this, classifiers are typically trained by optimizing a loss function computed\n",
    "from some continuous output of the model. We call it a \"surrogate loss\" as it\n",
    "substitutes the metric of interest. For instance `LogisticRegression`\n",
    "minimizes the `log_loss` applied to the `predict_proba` output of the model.\n",
    "By minimizing the surrogate loss, we maximize the accuracy. However\n",
    "scikit-learn does not provide surrogate losses for all possible classification\n",
    "metrics."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}